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Abstract 

Motivated by applications to modem networking technologies, there has been interest in designing 
efficient gossip-based protocols for computing aggregate functions. While gossip-based protocols provide 
robustness due to their randomized nature, reducing the message and time complexity of these protocols 
is also of paramount importance in the context of resource-constrained networks such as sensor and peer- 
to-peer networks. 

We present the first provably almost-optimal gossip-based algorithms for aggregate computation that 
are both time optimal and message-optimal. Given a n-node network, our algorithms guarantee that 
all the nodes can compute the common aggregates (such as Min, Max, Count, Sum, Average, Rank 
etc.) of their values in optimal O(logn) time and using 0(71 log log 71) messages. Our result improves 
on the algorithm of Kempe et al. |j9.| that is time-optimal, but uses O(nlogn) messages as well as on 
the algorithm of Kashyap et al. |[8l that uses 0(n log log tt,) messages, but is not time-optimal (takes 
0(log n log log n) time). Furthermore, we show that our algorithms can be used to improve gossip-based 
aggregate computation in sparse communication networks, such as in peer-to-peer networks. 

The main technical ingredient of our algorithm is a technique called distributed random ranking 
(DRR) that can be useful in other applications as well. DRR gives an efficient distributed procedure 
to partition the network into a forest of (disjoint) trees of small size. Since the size of each tree is 
small, aggregates within each tree can be efficiently obtained at their respective roots. All the roots then 
perform a uniform gossip algorithm on their local aggregates to reach a distributed consensus on the 
global aggregates. 

Our algorithms are non-address oblivious. In contrast, we show a lower bound of 57(71 log n) on the 
message complexity of any address-oblivious algorithm for computing aggregates. This shows that non- 
address oblivious algorithms are needed to obtain significantly better message complexity. Our lower 
bound holds regardless of the number of rounds taken or the size of the messages used. Our lower bound 
is the first non-trivial lower bound for gossip-based aggregate computation and also gives the first formal 
proof that computing aggregates is strictly harder than rumor spreading in the address-oblivious model. 

Keywords: Gossip-based protocols, aggregate computation, distributed randomized protocols, probabilistic 
analysis, lower bounds. 
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1 Introduction 



1.1 Background and Previous Work 

Aggregate statistics (e.g., Average, Max/Min, Sum, and. Count etc.) are significantly useful for many ap- 
plications in networks IH |5] IB HI [HI US l24l. These statistics have to be computed over data stored at 
individual nodes. For example, in a peer-to-peer network, the average number of files stored at each node 
or the maximum size of files exchanged between nodes is an important statistic needed by system designers 
for optimizing overall performance ll22l l25ll . Similarly, in sensor networks, knowing the average or max- 
imum remaining battery power among the sensor nodes is a critical statistic. Many research efforts have 
been dedicated to developing scalable and distributed algorithms for aggregate computation. Among them 
gossip-based algorithms |ITl|2l|4l|8l|9l[l2l[T6l[l7l|20l|23l have recently received significant attention because 
of their simplicity of implementation, scalability to large network size, and robustness to frequent network 
topology changes. In a gossip-based algorithm, each node exchanges information with a randomly chosen 
communication partner in each round. The randomness inherent in the gossip-based protocols naturally pro- 
vides robustness, simplicity, and scalability Q IH . We refer to Q HI |9l for a detailed discussion on the 
advantages of gossip-based computation over centralized and deterministic approaches and their attractive- 
ness to emerging networking technologies such as peer-to-peer, wireless, and sensor networks. This paper 
focuses on designing efficient gossip-based protocols for aggregate computation that have low message and 
time complexity. This is especially useful in the context of resource-constrained networks such as sensor and 
wireless networks, where reducing message and time complexity can yield significant benefits in terms of 
lowering congestion and lengthening node lifetimes. 

Much of the early work on gossip focused on using randomized communication for rumor propagation 
131 111 im. In particular, Karp et al. [7] gave a rumor spreading algorithm (for spreading a single message 
throughout a network of n nodes) that takes O(logn) communication rounds and O(nloglogn) messages. 
It is easy to establish that il(log n) rounds are needed by any gossip-based rumor spreading algorithm (this 
bound also holds for gossip-based aggregate computation). They also showed that any rumor spreading 
algorithm needs at least Q{n log log n) messages for a class of randomized gossip-based algorithms referred 
to as address-oblivious algorithms [7]. Informally, an algorithm is called address-oblivious if the decision to 
send a message to its communication partner in a round does not depend on the partner's address. Karp et 
al.'s algorithm is address-oblivious. For non-address oblivious algorithms, they show a lower bound of uj{n) 
messages, if the algorithm is allowed only O(logn) rounds. 

Kempe et al. 191 were the first to present randomized gossip-based algorithms for computing aggre- 
gates. They analyzed a gossip-based protocol for computing sums, averages, quantiles, and other aggregate 
functions. In their scheme for estimating average, each node selects another random node to which it sends 
half of its value; a node on receiving a set of values just adds them to its own halved value. Their protocol 
takes O(logn) rounds and uses 0(n log n) messages to converge to the true average in a n-node network. 
Their protocol is address-oblivious. The work of Kashyap et al. fSl was the first to address the issue of 
reducing the message complexity of gossip-based aggregate protocols, even at the cost of increasing the 
time complexity. They presented an algorithm that significantly improves over the message complexity of 
the protocol of Kempe et al. Their algorithm uses only 0(n log log n) messages, but is not time optimal 
— it runs in O(lognloglogn) time. Their algorithm achieves this 0(logn/ loglogn) factor reduction in 
the number of messages by randomly clustering nodes into groups of size O(logn), selecting representative 
for each group, and then having the group representatives gossip among themselves. Their algorithm is not 
address-oblivious. For other related work on gossip-based protocols, we refer to HI |2l and the references 
therein. 
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Table 1: DRR-gossip vs. other gossip-based algorithms. 



Algorithm 


time complexity 


message complexity 


address oblivious? 


efficient gossip ||8l 


0(lognlog logn) 


0(n log logn) 


no 


uniform gossip ||9l 


O(logn) 


O(nlogn) 


yes 


DRR-gossip [this paper] 


O(logn) 


0(n log logn) 


no 



1.2 Our Contributions 

In this paper, we present the first provably almost-optimal gossip-based algorithms for computing various 
aggregate functions that improves upon previous results. Given a n-node network, our algorithms guarantee 
that all the nodes can compute the common aggregates (such as Min, Max, Count, Sum, Average, Rank 
etc.) of their values in optimal O(logn) time and using 0(n log logn) messages. Our result (cf. Table [T]) 
improves on the algorithm of Kempe et al. {9} that is time-optimal, but uses O(nlogn) messages as well 
as on the algorithm of Kashyap et al. HI that uses 0(n log logn) messages, but is not time-optimal (takes 
0(log n log log n) time). 

Our algorithms use a simple scheme called distributed random ranking (DRR) that gives an efficient 
distributed protocol to partition the network into a forest of disjoint trees of 0(log n) size. Since the size of 
each tree is small, aggregates within each tree can be efficiently obtained at their respective roots. All the 
roots then perform a uniform gossip algorithm on their local (tree) aggregates to reach a distributed consensus 
on the global aggregates. Our idea of forming trees and then doing gossip among the roots of the trees is 
similar to the idea of Kashyap et al. The main novelty is that our DRR technique gives a simple and efficient 
distributed way of decomposing the network into disjoint trees (groups) which takes only O(logn) rounds 
and 0(n log log n) messages. This leads to a simpler and faster algorithm than that of The paper of ll20i 
proposes the following heuristic: divide the network into clusters (called the "bootstrap phase"), aggregate 
the data within the clusters — these are aggregated in a small subset of nodes within each cluster called 
clusterheads; the clusterheads then use gossip algorithm of Kempe et al to do inter-cluster aggregation; and, 
finally the clusterheads will disseminate the information to all the nodes in the respective clusters. It is not 
clear in flOl how to efficiently implement the bootstrap phase of dividing the network into clusters. Also, 
only numerical simulation results are presented in |20| to show that their approach gives better complexity 
than the algorithm of Kempe et al. It is mentioned without proof that their approach can take 0(n log log n) 
messages and O(logn) time. Hence, to the best of our knowledge, our work presents the first rigorous 
protocol that provably shows these bounds. 

Our second contribution is analyzing gossip-based aggregate computation in sparse networks. In sparse 
topologies such as P2P networks, point-to-point communication between all pairs of nodes (as assumed in 
gossip-based protocols) may not be a reasonable assumption. On the other hand, a small number of neighbors 
in such networks makes it feasible to send one message simultaneously to all neighbors in one round: in 
fact, this is a standard assumption in the distributed message passing model [19|. We show how our DRR 
technique leads to improved gossip-based aggregate computation in such (arbitrary) sparse networks, e.g., 
P2P network topologies such as Chord 1251 . The improvement relies on a key property of the DRR scheme 
that we prove: height of each tree produced by DRR in any arbitrary graph is bounded by O(logn) whp. 
In Chord, for example, we show that DRR-gossip takes 0(log^ n) time whp and O(nlogn) messages. In 
contrast, uniform gossip gives 0(log^ n) rounds and 0(n log^ n) messages. 

Our algorithm is non-address oblivious, i.e., some steps use addresses to decide which partner to com- 
municate in a round. The time complexity of our algorithm is optimal and the message complexity is within 
a factor o(loglogn) of the optimal. This is because, Karp et al [7| showed a lower bound of uj{n) for any 
non-address oblivious rumor spreading algorithm that operates in O(logn) rounds. (Computing aggregates 
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is at least as hard as rumor spreading.) 

Our third contribution is a non-trivial lower bound of Q(nlogn) on the message complexity of any 
address-oblivious algorithm for computing aggregates. This lower bound holds regardless of the number 
of rounds taken or the size of the messages (i.e., even assuming that nodes that can send arbitrarily long 
messages). Our result shows that non-address oblivious algorithms (such as ours) are needed to obtain a 
significant improvement in message complexity. We note that this bound is significantly larger than the 
0(n log log n) messages shown by Karp et al. for rumor spreading. Thus our result also gives the first formal 
proof that computing aggregates is strictly harder than rumor spreading in the address-oblivious model. 
Another implication of our result is that the algorithm of Kempe et al. |9] is asymptotically message optimal 
for the address-oblivious model. 

Our algorithm, henceforth called DRR-gossip, proceeds in phases. In phase one, every node runs the 
DRR scheme to construct a forest of (disjoint) trees. In phase two, each tree computes its local aggregate 
(e.g., sum or maximum) by a convergecast process; the local aggregate is obtained at the root. Finally in 
phase three, all the roots utilize a suitably modified version of the uniform gossip algorithm of Kempe et 
al. m to obtain the global aggregate. Finally, if necessary, the roots forward the global aggregate to other 
nodes in their trees. 

1.3 Organization 

The rest of this paper is organized as follows. The network model is described in Section [2] followed by 
sections where each phase of the DRR-gossip algorithm is introduced and analyzed separately. The whole 
DRR-gossip algorithm is summarized in Section 3.4 Section [4] applies DRR-gossip to sparse networks. 



An lower bound on the message complexity of any address-oblivious algorithm for computing aggregates is 
presented and proved in Section [5] Section [L4] lists the main probabilistic tools used in our analysis — the 
Doob martingale and Azuma's inequality. Section [6] concludes with some open questions. 

1.4 Probabilistic Preliminaries 

We use Doob martingales extensively in our analysis fT?|. Let Xq, . . . , Xn be any sequence of random vari- 
ables and let Y be any random variable with -^[ll^l] < oo. Define the random variable Zi = E[Y\Xq, . . . , Xi], 
i = 0, 1, . . . , n. Then Zq, Zi, . . . , Zn form a Doob martingale sequence. 

We use the martingale inequality known as Azuma's inequality, stated as follows llT4ll . Let Xo,Xi, . . . 
be a martingale sequence such that for each k, 

\Xk — Xk-i\ < Cfc 
where Ck may depend on k. Then for alH > and any A > 0, 



A" 



Pii\Xt - Xo\ > A) < 2e (1) 

We also need the following variant of the Chernoff bound from fTE\, that works in the case of dependent 
indicator random variables that are correlated as defined below. 



Lemma 1 ( lUW } Let Zi, Z2, . . . , G {0, 1} be random variables such that for all I, and for any Si^i C 
{l,...,Z-l},Pr(Zi = l|A,e5, 
((I^TT^n where tJi=tE[Zi 



{I,..., I- l},Pr(Z/ = 1| A,e5, 1 = 1) < Pr(Z; = 1). Then for any 5 > 0, Pr(^ Zi > + 6)) < 

1=1 



1=1 
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Algorithm 1: F =DRR(G) 



foreach node i G V do 

choose rank{i) independently and uniformly at random from [0, 1] ; 

set found = FALSE // higher ranked node not yet found ; 

set parent{i) = NULL // initially every node is a root node; 

set A: = // number of random nodes probed ; 

repeat 

sample a node u independently and uniformly at random from V and get its rank ; 
if rank{u) > rank[i) then 

set parent{i) = u; 

set found = TRUE; 

set k = k + 1; 

end 

until found == TRUE or k < \ogn — 1; 
if found == TRUE then 

I send a connection message including its identifier, i, to its parent node parent{i); 
end 

Collect the connection messages and accordingly construct the set of its children nodes, Child{i); 
if Child{i) = then 
I become a leaf node; 
else 

I become an intermediate node; 
end 

end 



2 Model 

The network consists of a set V ofn nodes; each node i has a data value denoted by Vi. The goal is to 
compute aggregate functions such as Min, Max, Sum, Average etc., of the node values. 

The nodes communicate in discrete time-steps referred to as rounds. As in prior works on this problem ||71 
[H, we assume that communication rounds are synchronized, and all nodes can communicate simultaneously 
in a given round. Each node can communicate with every other node. In a round, each node can choose a 
communication partner independently and uniformly at random. A node i is said to call a node j if i chooses 
j as a communication partner. (This is known as the random phone call model Q.) Once a call is established, 
we assume that information can be exchanged in both directions along the link. In one round, a node can call 
only one other node. We assume that nodes have unique addresses. The length of a message is limited to 
0(log n + log s), where s is the range of values. It is important to hmit the size of messages used in aggregate 
computation, as communication bandwidth is often a costly resource in distributed settings. All the above 
assumptions are also used in prior works |l8l|9l. Similar to the algorithms of |l8l|9l, our algorithm can tolerate 
the following two types of failures: (i) some fraction of nodes may crash initially, and (ii) links are lossy and 
messages can get lost. Thus, while nodes cannot fail once the algorithm has started, communication can fail 
with a certain probability 5. Without loss of generality, 1/logn < 5 < 1/8: Larger values of 5, requires 
only 0(1/ log(l/5)) repeated calls to bring down the probability below 1/8, and smaller values only make 
it easier to prove our claims. 

Throughout the paper, "with high probability (whp)" means "with probability at least 1 — 1/n", for some 
a > 0". 
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3 DRR-Gossip Algorithms 

3.1 Phase I: Distributed Random Ranking (DRR) 

The DRR algorithm is as follows (cf. Algorithm [TJ. Every node i ^ V chooses a rank independently and 
uniformly at random from [0, 1]. (Equivalently, each node can choose a rank uniformly at random from [l,n^] 
which leads to the same asymptotic bounds; however, choosing from [0, 1] leads to a smoother analysis, e.g., 
allows use of integrals.) Each node i then samples up to log n — 1 random nodes sequentially (one in each 
round) till it finds a node of higher rank to connect to. If none of the log n — I sampled nodes have a higher 
rank then node i becomes a "root". Since every node except root nodes connects to a node with higher 
rank, there is no cycle in the graph. Thus this process results in a collection of disjoint trees which together 
constitute a forest F. 

In the following two theorems, we show the upper bounds of the number of trees and the size of each tree 
produced by the DRR algorithm; these are critical in bounding the time complexity of DRR-gossip. 

Theorem 2 (Number of Trees) The number of trees produced by the DRR algorithm is 0(n/ log n) whp. 

Proof: Assume that ranks have already been assigned to the nodes. All ranks are distinct with proba- 
bility 1. Number the nodes according to the order statistic of their ranks: the ith node is the node with the 
ith smallest rank. Let the indicator random variable Xi take the value of 1 if the ith smallest node is a root 
and otherwise. Let X = X^ILi -^i ^^^^^ number of roots. The ith smallest node becomes a root if all 

the nodes that it samples have rank smaller than or equal to itself, i.e., Pr(Xi = 1) = (^)'°^" ^ • Hence, by 
linearity of expectation, the expected number of roots (and thus, trees) is: 

" " /i\^ogn-l / ^„/ .\ log 71-1 \ / „ \ 

.,.,.|:p.(x..,.|:(i) .e(/ (1) -)^e(^)^ 

Note that XjS are independent (but not identically distributed) random variables, since the probability that 
the ith smallest ranked node becomes the root depends only on the log n — 1 random nodes that it samples 
and independent of the samples of the rest of the nodes. Thus, applying a Chemoff 's bound 111 41 . we have: 
Pr(X > 6E[X]) < 2^[^1 = o{l/n). U 

Theorem 3 (Size of a tree) The number of nodes in every tree produced by the DRR algorithm is at most 
O(logn) whp. 

Proof: We bound that the probability that a tree of size ri(log n) is produced by the DRR algorithm. Fix 
a set 5 of A; = c log n nodes, for some sufficiently large positive constant c. We first compute the probability 
that this set of k nodes form a tree. For the sake of analysis, we will direct tree edges as follows: a tree edge 
(i, j) is directed from node i to node j if rank{i) < rank{j), i.e. i connects to j. Without loss of generality, 
fix a permutation of S: (si, . . . , Sa, ■ ■ ■ , sp, . . . , Sk) where rank{sa) > rank{sfs), I < a < P < k. 
This permutation induces a directed spanning tree on S in the following sense: si is the root and any other 
node Sa (1 < a < k) connects to a node in the totally (strictly) ordered set {si, . . . , Sa-i} (as fixed by 
the above permutation). For convenience, we denote the event that a node s connects to any node on a 
directed tree, T, as s — )• T. Note that s — )• T implies that s's rank is less than that of any node on the 
tree T. Also, we denote the event of a directed spanning tree being induced on the totally (strictly) ordered 
set {si, S2, ■ ■ ■ 1 Sa, • • • , Sh} as Th, where a node Sa can only connect to its preceding nodes in the ordered 
set. As a special case, Ti is the event of the induced directed tree containing only the root node si. We are 
interested in the event Tk, i.e., the set S of k nodes forming a directed spanning tree in the above fashion. In 
the following, we bound the probability of the event Tj. happening: 

Pr(rfc) = Pr (Ti n {S2 ^ Ti) n (ss ^ Ts) n • • • n {sk ^ Tfc_i)) 

= Pr(ri) Pr(s2 ^ Til Ti) Pr(s3 ^ Taj Ts) . . . Pr(sfc ^ n-i\ Tfc_i). (2) 
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To bound each of the terms in the product, we use the principle of deferred decisions: when a new node 
is sampled (i.e., for the first time) we assign it a random rank. For simplicity, we assume that each node 
sampled is a new node — this does not change the asymptotic bound, since there are now only k = 0(log n) 
nodes under consideration and each node samples at most O(logn) nodes. This assumption allows us to 
use the principle of deferred decisions to assign random ranks without worrying about sampling an already 
sampled node. Below we bound the conditional probability Pr(sQ, — Ta-i\ Ta-i), for any 2 < a < as 
follows. Let rq = rank{sq) be the rank of node Sq,l<q<a; then 



Pr(s, ^T,_i|r„_i)< / / / .../ (^^^)radra..-dn. 



Jo 



h=0 



The explanation for the above bound is as follows: Since Ta-i is a directed spanning tree on the first a — 1 
nodes, and Sa connects to Ta-i, we have ri > r2 > • • • > Va-i > Va- Hence ri can take any value between 
and 1 , r2 can take any value between and ri and so on. This is captured by the respective ranges of the 
integrals. The term inside the integrals is explained as follows. There are at most log n — 1 attempts for node 
Sa to connect to any one of the first a — 1 nodes. Suppose, it connects in the /ith attempt. Then, the first h — 1 
attempts should connect to nodes whose rank should be less than r^, hence the term (as mentioned earlier, 
we assume that we don't sample an already sampled node, this doesn't change the bound asymptotically). 
The term (a — l)/n is the probability that Sq, connects to any one of the first a — 1 nodes in the /ith attempt. 
Simplifying the right hand side, we have. 



a — 1 



n 



1 t-ri [■r2 i-ra-i 



JQ JO Jo 



a - 1 /O! 1! 2! (logn)! 

n Va! ^ (a + 1)! ^ (a + 2)! + ' ' ' + (logn + a)! 



The above expression is bounded by ^, where 0<6<lifa>2 and < 6 < (1 — ) if o = 2. 



2 1; hence, the equation (2 



Besides, Pr(Ti) < (cf. Theorem! 

Using the above, the probability that a tree of size k 
bounded by 



I is bounded by (^)' ' j^. 
og n is produced by the DRR algorithm is 



if c sufficiently large. 

Complexity of Phase I — the DRR algorithm 



Theorem 4 The message complexity of the DRR algorithm is (9(n log logn) whp. The time complexity is 
O(logn) rounds. 

Proof: Let d = log n — 1. Fix a node i. Its rank is chosen uniformly at random from [0, 1]. The expected 
number of nodes sampled before a node i finds a higher ranked node (or else, all d nodes will be sampled) is 
computed as follows. The probability that exactly k nodes will be sampled is 6(|^^), since the last node 
sampled should be the highest ranked node and i should be the second highest ranked node (whp, all the nodes 



sampled will be unique). Hence the expected number of nodes probed is X^^^^ O [k-j^^j = 0{logd). 
Hence the number of messages exchanged by node i is 0(log d). By linearity of expectation, the total number 
of messages exchanged by all nodes is 0(n log d) = 0{n log log n). 

To show concentration, we set up a Doob martingale as follows. Let X denote the random variable that 
counts the total number of nodes sampled by all nodes. E[X] = 0{n log d). Assume that ranks have already 
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Algorithm 2: cavmax =convergecast-max(F,v) 
Input: the ranking forest F, and the value vector v over all nodes in F 
Output: the local Max aggregate vector cavmax over roots 
foreach leaf node do send its value to its parent; 
foreach intermediate node do 

- collect values from its children; 

- compare collected values with its own value; 

- update its value to the maximum amid all and send the maximum to its parent. 

end 

foreach root node z do 

- collect values from its children; 

- compare collected values with its own value; 

- update its value to the local maximum value cavmax{z)- 

end 



been assigned to the nodes. Number the nodes according to the order statistic of their ranks: the ith node 
is the node with the ith smallest rank. Let the indicator r.v. Zik (1 < i < n, I < k < d) indicate whether 
the kth sample by the ith smallest ranked node succeeded or not (i.e., it found a higher ranked node). If 
it succeeded then = 1 for all j < k and Zij = for all j > k. Thus X = J2i^=i J2k=i ^ik- Then 
the sequence Xq = E[X],Xi = E[X\Zii], . . . , Xnd = E[X\Zii, . . . , Znd] is a Doob martingale. Note 
that \Xi — < d {1 < £ < nd) because fixing the outcome of a sample of one node affects only the 

outcomes of other samples made by the same node and not the samples made by other nodes. Applying 
Azuma's inequality, for a positive constant e we have: 

/ e^n^ \ 
Pr(|X - E[X]\ > en) < 2exp -— ^ = o(l/n). 

The time complexity is immediate since each node probes at most 0(log n) nodes in as many rounds. ■ 
3.2 Phase II: Convergecast and Broadcast 

In the second phase of our algorithm, the local aggregate of each tree is obtained at the root by the Con- 
vergecast algorithm — an aggregation process starting from leaf nodes and proceeding upward along the tree 
to the root node. For example, to compute the local max/min, all leave nodes simply send their values to 
their parent nodes. An intermediate node collects the values from its children, compares them with its own 
value and sends its parent node the max/min value among all received values and its own. A root node then 
can obtain the local max/min value of its tree. Algorithm [2] and Algorithm [3] are the pseudo-codes of the 
Convergecast-max algorithm and the Convergecast-sum algorithm, respectively. 

After the Convergecast process, each root broadcasts its address to all other nodes in its tree via the tree 
links. This process proceeds from the root down to the leaves via the tree links (these two-way links were 
already established during Phase 1.) At the end of this process, all non-root nodes know the identity (address) 
of their respective roots. 
Complexity of Phase II 

Every node except the root nodes needs to send a message to its parent in the upward aggregation process 
of the Convergecast algorithms. So the message complexity is 0{n). Since each node can communicate with 
at most one node in one round, the time complexity is bounded by the size of the tree. (This is the reason for 
bounding size and not just the height.) Since the tree size (hence, tree height also) is bounded by O(logn) 
(cf. Theorem |3]) the time complexity of Convergecast and Broadcast is O(logn). Moreover, as the number 
of roots is at most 0{n/ log n) by Theorem [2| the message complexity for broadcast is also 0{n). 
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Algorithm 3: cavsum =convergecast-sum(F,v) 
Input: the ranking forest F and the value vector v over all nodes in F 
Output: the local Ave aggregate vector cavmax over roots. 

Initialization: every node i stores a row vector {vi,Wi = 1) including its value Vi and a size count Wi; 
foreach leaf node i G F do 

- send its parent a message containing the vector {vi, Wi = 1); 
-reset {vi.Wi) = (0, 0). 

end 

foreach intermediate node j S F do 

- collect messages (vectors) from its children; 

- compute and update Vj = vj + Y.k&ChUd{j) '^k, and wj = Wj + EkeChiid{j) '^k, where 
Child{j) = {j's children nodes}; 

- send computed {vj, wj) to its parent; 

- reset its vector {vj,Wj) = (0, 0) when its parent successfully receives its message. 

end 

foreach root node z G V do 

- collect messages (vectors) from its children; 

- compute the local sum aggregate covsum{z, I) = Vz + J2keChiid{z) '^k, and the size count of the 
tree cavsum{z, 2) = Wz + Ylk&ChUd{z) '^k, where Child{z) = {z's children nodes}. 

end 



3.3 Phase III: Gossip 

In the third phase, all roots of the trees compute the global aggregate by performing the uniform gossip 
algorithm on the graph G = clique{V), where V (^V is the set of roots and \V\ = m = 0{n/ \ogn). 

The idea of uniform gossip is as follows. Every root independently and uniformly at random selects a 
node to send its message. If the selected node is another root then the task is completed. If not, the selected 
node needs to forward the received message to its root (all nodes in a tree know the root's address at the end 
of Phase II — here is where we use a non-address oblivious communication). Thus, to traverse through an 
edge of G, a message needs at most two hops of G. 

Algorithm [4j Gossip-max, and Algorithm [6j Gossip-ave (which is a modification from the Push-Sum 
algorithm of [8. 9]) compute the Max and Ave aggregates respectively (other aggregates such as Min, Sum 
etc., can be calculated by a suitable modification). Note that, unlike Gossip-max, Gossip-ave algorithm does 
not need a sampling procedure. 

Algorithm [5j Data-spread, a modification of Gossip-max, can be used by a root node to spread its value. 
If a root needs to spread a particular value over the network, it sets this value as its initial value and all other 
roots set their initial value to minus infinity. 

3.3.1 Performance of Gossip-max and Data-spread Algorithms 

Let m denote the number of root nodes. By Theorem|2| we have m = \ V\ = 0{n/ logn) where n = \V\. 
Karp, et al. Q show that all m nodes of a complete graph can know a particular rumor (e.g., the Max in our 
application) in 0(log m) = 0(log n) rounds with high probability by using their Push algorithm (a prototype 
of our Gossip-max algorithm) with uniform selection probability. Similar to the Push algorithm, Gossip-max 
needs 0{rn log m) = 0{n) messages for all roots to obtain Max if the selection probability is uniform, i.e., 
1/m. However, in the implementation of the Gossip-max algorithm on the forest, the root of a tree is selected 
with a probability proportional to its size (number of nodes in the tree). Hence, the selection probability is 
not uniform. In this case, we can only guarantee that after the gossip procedure of the Gossip-max algorithm, 
a portion of the roots including the root of the largest tree will possess the Max. After the gossip procedure. 
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Algorithm 4: Scmax =Gossip-max(G, F, V, y) 



Initialization: every root i G 1/ is of the initial value a;o,j = y{i) from the input y. 

/* To compute Max, a;o,j = y{i) = cowrnax{i); To compute Ave, xo,i = y{i) = covsum{i, 2). * /; 

Gossip procedure:; 

for t=l : O(logn) rounds do 

Every root i E V independently and uniformly at random, selects a node in V and sends the 

selected node a message containing its current value xt-i^i.; 

Every node j — V forwards any received messages to its root.; 

Every root i GV; 

— collects messages and compares the received values with its own value; 

— updates its current value xt^i, which is also the Xmax,t(0> node i's current estimate of Max, 
to the maximum among all received values and its own.; 

end 

Sampling procedure:; 

for t=l : ^ log n rounds do 

Every root i E V independently and uniformly at random selects a node in V and sends each of 
the selected nodes an inquiry message.; 

Every node j eV — V forwards any received inquiry messages to its root.; 
Every root i eV, upon receiving inquiry messages, sends the inquiring roots its value.; 
Every root i eV, updates xt^i, i.e. x^ox,t(0' ^ the maximum value it inquires, 
end 



Algorithm 5: Xm =Data-spread(G, F, V, Xm) 

Initialization: A root node i G F which intends to spread its value Xru, \xru\ < oo sets xo,i = X; 

All the other nodes j set xqj = — oo.; 

Run gossip-max(G, F, V, xq) on the initialized values. 



Algorithm 6: Xa^ye =Gossip-ave(G, F, V, covsum) 

Initialization: Every root i eV sets a vector (so,i, go,i) = covsu^(z), where so,i and yo,i are the 
local sum of values and the size of the tree rooted at i, respectively.; 
for t = 1 : 0(logm + Iog(l/e)) roM?i<i5' do 

Every root node i G V independently and uniformly at random selects a node in V and sends the 

selected node a message containing a row vector {st-i,i/2, 

Every node j eV — V forwards any received messages to the root of its ranking tree.; 

Let At^i C y be the set of roots whose messages reach root node i at round t. Every root node 

i £ V updates its row vector by; 

st,i = st-i,i/2 + EjeAi,, st-i,j/2,; 

gt,i = 9t-i,i/2 + Ylj&At^i 5t-i,i/2-; 

Every root node i&V updates its estimate of the global average by ^ave,t{i) = Xave,t,i = st,i/gt,i- 
end 



roots can sample 0(log n) number of other roots to confirm and update, if necessary, their values and reach 
consensus on the global maximum. Max. 

We show the following theorem for Gossip-Max 
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Theorem 5 After the gossip procedure of the Gossip-max algorithm, at least root nodes obtain the 

global maximum, Max, whp, where n =\V\ and Q < c < 1 is a constant. 

Proof: As per our failure model, a message may fail to reach the selected root node with probability p 
(which is at most 25, since failure may occur either during the initial call to a non-root node or during the 
forwarding call from the non-root node to the root of its tree). For convenience, we call those roots who 
know the Max value (the global Maximum) as the max-roots and those who do not as the non-max-roots. 

Let Rt be the number of max-roots in round t. Our proof is in two steps. We first show that, whp, 
Rt > 41ogn after 81ogn/(l — p) rounds of Gossip-max. If Rq > 41ogn then the task is completed. 
Consider the case when Rq < A log n. Since the initial number of max-roots is small in this case, the chance 
that a max-root selects another max-root is small. Similarly, the chance that two or more max-roots select 
the same root is also small. So, in this step, whp a max-root will select a non-max-root to send out its gossip 
message. If the gossip message successfully reaches the selected non-max-root, the Rt will increase by 1. 
Let Xj denote the indicator of the event that a gossip message i from some max-root successfully reaches 
the selected non-max-root. We have Pr{Xi = 1) = (1 — p). Then X = Yli=i"^^^ -^i '^^e minimal 
number of max-roots after 8 log n/ (1 — p) rounds. Clearly, E[X] = 8 log n. Here we conservatively assume 
the worst situation that initially there is only one max-root and at each round only one max-root selects a 
non-max-root. So X is the minimal number of max-roots after 8 log n/(l — p) rounds. 

Applying Azuma's inequality [1141 and setting e = 1/2: 



Pr{\X - E[X]\ > eE[X]) < 2exp 



e^EjX]^ 
'2( ^'°g" l 



(-E\X\^ \ 
, = 2exp(-logn) = 2 • 

16 log n / 

Hence, with probability at least 1 — f , after 8 log n/(l — p) = 0(log n) rounds, Rt > = 4 log n. 

In the second step of our proof, we find the lower bound of the increasing rate of Rt when Rt > -i log n. 
In each round, there are Rt messages sent out from max-roots. Let Yi denote the indicator of an event 
that such an message i from a max-root successfully reaches a non-max-root. The Yi = when one of 
the following event happens. (1) The message i fails in routing to its destination in probability p. (2) The 
message i destined to another max-root although it successfully travels over the network with probability 
(1 — p). The probability of this event is at most p)Riiogn ^jj^^g ^j^p j-j^g ^j^e of a ranking tree is 0(log n) 
(cf. Theorem [3]). (3) The message i and at least one another message are destined to the same non-max-root. 
As the probability three or more messages are destined to a same node is very small, we only consider the case 
that two messages select the same non-max-root. We also conservatively exclude both two messages on their 
possible contributions to the increase of Rt. This event happens with the probability at most (^^p)-^* logw ^ 

Applying union bound [,14.1 . 

Pr(r, = 0)<p+^"-'''^''°'^'\ 

n 

Since Rt < for any constant < c < 1 (otherwise, the task is completed), 

Pr{Yi = 0) < p + 2c(l - p) = c' + (1 - c')p, 

where c' = 2c < 1 is a constant that is suitably fixed so that c' + (l — c')p< 1. Consequently, we have 
Pr{Yi = 1) > (1 - c')(l - p), and E[Y] = Y,f=i E[Yi\ > (1 - c')(l - p)Rt- Applying Azuma's inequality, 

Pr{\Y - E[Y]\> eE[Y]) <2ex];)' ^ ^ 

< 2exp 



eHl-c'fil-pfR 



2Rt 

t 
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Since in this step, whp Rt > 41ogn, and (1 — c')^(l — p)^ > 0, setting e = | and a = 0(1), we obtain 

Pr(Y < ^(1 - c')(l - p)Rt) < 2 • n-". 

Thus, whp, Rt+i > Rt + ^{1- c')(l - p)Rt = PRt, where /3 = 1 + |(1 - c')(l - /o) > 1- Therefore, whp, 
after (8 log n/(l — p) + logp n) = 0(log n) rounds, at least ^i^^^) roots will have the Max. ■ 
Sampling Procedure 

From Theorem jsj after the gossip procedure, there are ^^(j^^) = fi(cm), < c < 1 nodes with the 
Max value. For roots to reach consensus on Max, they sample each other as in the sampling procedure. It is 
possible that the root of a larger tree will be sampled more frequently than the roots of smaller trees. However, 
this non-uniformity is an advantage, since the roots of larger trees would have obtained Max (in the gossip 
procedure) with higher probability due to this same non-uniformity. Hence, in the sampling procedure, a 
root without Max can obtain M ax with higher probability by this non-uniform sampling. Thus, we have the 
following theorem 

Theorem 6 After the sampling procedure of Gossip-max algorithm, all roots know the Max value, whp. 
Proof: After the sampling procedure, the probability that none of the roots possessing the Max is 

sampled by a root not knowing the Max is at most { ^^^ ) = < i. Thus, after the sampling procedure, 
with probability at least 1 — ^, all the roots will know the Max. ■ 
Complexity of Gossip-max and Data-spread algorithms 

The gossip procedure takes 0(log n) rounds and 0{m log n)=0{^^^^ log n)=0{n) messages. The sam- 
pling procedure takes log n)=0(log n) rounds and log n)=0{n) messages. To sum up, this phase 
totally takes 0(log n) rounds and 0{n) messages for all the roots in the network to reach consensus on Max. 
The complexity of Data-spread algorithm is the same as Gossip-max algorithm. 



3.3.2 Performance of Gossip-ave Algorithm 

When the uniformity assumption holds in gossip (i.e., in each round, nodes are selected uniformly at random), 
it has been shown in |9| that on an m-clique with probability at least 1 — 5' , Gossip-ave (uniform push-sum 
in m) needs 0(log m + log \ + log jr) rounds and 0(m(log m + log ^ + log ^)) messages for all m nodes 
to reach consensus on the global average within a relative error of at most e. When uniformity does not 
hold, the performance of uniform gossip will depend on the distribution of selection probability. In efficient 
gossip algorithm [8 1, it is shown that the node being selected with the largest probability will have the global 
average, Ave, in 0(log m + log -) rounds. Here, we prove that the same upper bound holds for our Gossip- 
ave algorithm, namely, the root of the largest tree will have Ave after 0(log m + log \ ) rounds of the gossip 
procedure of Gossip-ave algorithm. In this bound, m = 0(n/ log n) is the number of roots (obtained from 
the DRR algorithm) and the relative error e = n^", a > 0. 

Theorem 7 Whp, there exists a time Tave = 0(log m + a log n) = 0(log n), a > 0, such that for all time 
t > Tave> the relative error of the estimate of average aggregate on the root of the largest ranking tree, z, is 
at most n_i , where the relative error is '""'i' , ""^ , and the average aggregate, Ave, is Xave = 

We recall that the gossip-ave algorithm works on the graph G = clique{V), where F C F is the set 
of roots and \ V\ = m = Oinj logn). To prove Theorem [tI we need some definitions as in Q- We define 
a m-tuple contribution vector yt,i such that st^i = yt,i • x = J2jyt,i,jXj and ujt,i = \\yt,ih = Yljyt,i,j' 
where yt^ij is the j-th entry of yj j and Xj is the initial value at root node j, i.e., Xj = covsumU) > the local 
aggregate of the tree rooted at node j computed by Convergecast-sum. yo,i = ei, the unit vector with the i-th 
entry being 1. Therefore, yt^ij = 1, and wt^i = m. When yt,i is close to ^1, where 1 is the vector 
with all entries 1, the approximate of Ave, Xave,t,i = is close to the true average Xave- Note that vjt,i, 
which is different from gt^i, is a dummy parameter borrowed from ^ to characterize the diffusion speed. 
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In our Gossip-ave algorithm, we set go^i to be the size of the root i's tree. The algorithm then computes 
the estimate of average directly by Xave.t,i = st^i/gt.i- If we set a dummy weight wt^i, whose initial value 
WQ^i = 1, Vi E V, the algorithm performs in the same manner: every node works on a triplet (s^ j, gt^i, wt^i) 

and computes Xave t i = l^''''^"'*''! - isti/wti) is the estimate of the average local sum on a root and gti/wa 

' ' \9t,i/ "^t.i / ' ' ' ' 

is the estimate of the average size of a tree. Their relative errors are bounded in the same way as follows. 

The relative error in the contributions (with respect to the diffusion effect of gossip) at node i at time t is 
At i = maxj I J'^, —1 = 11 — • llloo- The following potential function 

^'^ ■''\\yt,i\\i ml llllyi.illi m HOC bf 

^-^ m 

is the sum of the variance of the contributions yt,i,j- We name the root of the largest tree as node z. 
To prove Theorem [7] we need some auxiliary lemmas. 

Lemma 8 (Geometric convergence of <I>) The conditional expectation 

E[<i>t+,\^t = </>] = 1(1 - ^ P^)^ < 

iev 

where Pi = (1 — 5)^ is the probability that the root node i is selected by any other root node, gi is the size 
of the tree rooted at node i, 6 is the probability that a message fails to reach its destined root node, and n is 
the total number of nodes in the network. 

Proof: This proof is generalized from |[9l. The difference is that the selection probability, Pi, is not 
uniform any more but depends on the tree size, g^. Pi is the probability that root i is selected by any other 
root and X^jgy Pf is the probability that two roots select the same root. The conditional expectation of 
potential at round t + 1 is 




i,3,k 



j,k k'^k i^y 
k,j,k' 

hj i^V k 
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The last equality follows from the fact that 



k k k 



Lemma 9 There exists a t = O(logm) such that after \/t > r rounds of Gossip-ave, wt^z ^2 at z, the 
root of the largest tree. 

Proof: In the case that the selection probability is uniform, it has been shown in f9l that on an m-clique, 
with probability at least 1 — ^, after 4 log m+ log 25' rounds, a message originating from any node (through a 
random walk on the clique) would have visited all nodes of the clique. When the distribution of the selection 
probability is not uniform, it is clear that a message originating from any node must have visited the node 
with the highest selection probability after a certain number of rounds that is greater than 4 log m + log 25' 
with probability at least 1 — ^. ■ 
From the previous two lemmas, we derive the following theorem. 

Theorem 10 (Diffusion speed of Gossip-ave) With probability at least 1 — 5', there exists a time Tave = 
0(logm + log J + log jt), such that Vt > Tave, the contributions at z, root of the largest tree, is nearly 



uniform, i.e., maxj 



yt,z 



m I 



liyi,i 



m 



Proof: By Lemma [s] we obtain that E[^t] < {n^ 
we set T 



< e. 



1)2- 



41ogm + log ^ and = ■ ^ ■ 2"^^. Then after t 



E[^t] < £■ By Markov's inequality IJJ-J, with probability at least 1 
— I < e • 2""^ for all the root nodes i. 



* < m2 *, as <I>o = (m — 1). By Lemmajoj 
log m + log i rounds of Gossip-ave, 



5' 



, the potential $4 < • 2 which 



guarantees that \yt. 
To have max 



Vt. 



m I 



\yt,z 



' liyi,2lli 

1 > 2""^ with probability at least 1 



< e, we need to lower bound the weight of node z. From Lemma 



r. Note that Lemma 



[9] only appl 



ies to z, the root of the 



largest tree. (A root node of a relatively small tree may not be selected often enough to have such a lower 
bound on its weight.) Using union bound, we obtain, with probability at least 1 — 5', maxj | Hy^'""'!^ ~ m I — ^■ 



Now we are ready to prove Theorem [7] 
Proof of Theorem |7] 



5', it is guaranteed that after Tave 
rounds of Gossip-ave, at z, the root of the largest tree, || n.^'' i, • 1| 



Proof: From Theorem 10 with probability at least 1 

log 7 + log 



g, 1 ^ ^, „ , II ||yj^.||j 

e = n^" and 5' = n^", q > 0, then Tave = 0(21ogm + 2a log n) = O(logn). 
Using Holder's inequality, we obtain 



= 0(2 log m + 
< ^. Let both 



St,z 



n Ej 
m Ej 

< m ■ - 



• 1 x 



yt,; 



\\yt,z\\: 



\\yt,z\\i m 
m Ej 
l||oo • ||x||l 



m ■ 



Vl!yt,zlli J 


•X 




E, 





E,^J 



< e- 



Ej 

When all xj have the same sign, we have 



< e. Further, we need to bound the relative error of 



I m '^3 I 

Ave. W. 1. o. g., let the true average of the sum of values in a tree be positive, i.e.. Save = ^ Ej > ^^'^^ 
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by definition, tlie true average of tlie size of a tree is also positive, i.e., gave = ^ X^j 9j ~ m ^ ^- Therefore, 
the global average Ave is Xave = f^- Since | ^ - Save\ < esave and | - gave\ < egave we obtain 



St,: 

St,z \wt,. 

^ave,tz 

9t,z I 9hi 



1 € Save 1 ~1~ <5q 



1 ~l~ £ gave 1 C gave 



Set e' = ce, where c = jjzr^ > 2 is bounded when e < 1. (For example, if e < 10 ^, then c = 2.02 and 
e' = ^e.) We set e = and then e' = ^^^rj ^ 2e. Thus, with probability at least 1 — the relative 
error at z is 

\Xave,t,z Xave\ ^ / 

i i ^ ' 

y^ave\ 

after at most 0(log m + 2a log n) = 0(log n) rounds of Gossip-ave algorithm. 

The above assumption that all Xj have the same sign is just for complexity analysis but not for the 
execution of the gossip-ave algorithm. The gossip-ave algorithm works well without any assumption on the 
values of roots. In the following, we further relax this assumption and show that the upper bound on the 
running time is also valid when Xj are not all of the same sign. 

Let 7 = ||x||i 7^ and x' = x + 27 • 1 > 0, i.e., all x'^ > have the same sign. It is obvious that 
the average aggregate of the x' is a simple offset of the average aggregate of the x, i.e., x'^^^ = Xave + 27. 
Proceeding through the same data exchanging scenario in each round of the gossip-ave algorithm on x and 
x', after t rounds, at root node z, we have the relationship between the two corresponding estimates of the 
average aggregates on x' and x: x'ave,t,z = Xave,t,z + 27. The desired related error is ^^"""j*' ""[^""^^ < e'- 
Let 7 = 0(n°) and a stricter threshold e = e' < e'- As all x'j are of the same sign, whp at least 

1 — ^, after t = 0(logn + log i + log jj) = O(logn) rounds, 

\^ave,t, z "^avel \{^ave,t, z ~l~ 27) {Xave ~l~ 27)] ^ g ^1 \-^ave\ 



\-^ave\ l^ave ~l~ 27] |3^ai)e ~l~ 27] 

From the above equation, we conclude that 

\Xave,t,z Xave\ , / 

— Y — I — ^ ■ 

\^ave\ 

That is to say, running the gossip-ave algorithm on an arbitrary vector x, whp at least 1 — j/, after t = 
0(Iogn + log i + log jf) = O(logn) rounds, the relative error of the estimate of the average aggregate is 
less than e' = = 0(n-"). 

■ 

Evaluating the performance of the gossip-ave algorithm using the criterion of relative error causes a 
problem when Xave = whereas the gossip-ave algorithm works well when Xave = 0. In this case, using 
absolute error criterion, i.e. \xave,t,z — Xave\ = |^ai)e,t, zl < e' is more suitable. Here, we would show that 
the upper bound of running time of Theorem [7] is also valid for the case that Xave = and the performance 
is assessed under the absolute error criterion \xave,t,z\ < e'- By the similar technique as in the above proof, 
choose an offset constant 7 = 0(n") > 1 such that x' = x + 7 • 1 > 0, i.e., all x'^ > are with the same 
sign. Also, let e = , f' , = — < e'. Proceeding through the same data exchanging scenario in each round of 

the gossip-ave algorithm on x and x', whp at least 1 — ^, after t = 0{\og n + log i + log jj) = 0(log n) 
rounds, 



\'^ave,t,z '^ave\ 



|(Xa^,e,t,^ +7) " TI < g _ ^' 
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From the above equation, we have that jxaije.t.zl < e'- This concludes the mapping relationship between the 
relative error criterion and the absolute error criterion. 
Complexity of Gossip-ave 

Gossip-ave algorithm needs 0(logm + log^) = O(logn) rounds and m ■ O(logn) = 0{n) messages 
for the root of the largest tree to have the global average aggregate, Ave, within a relative error of at most 



3.4 DRR-gossip Algorithms 

Putting together our results from the previous subsections, we present Algorithm [7j DRR-gossip-max algo- 
rithm, and Algorithm[8j DRR-gossip-ave algorithm, for computing Max and Ave, respectively. To conclude 
from previous sections, the time complexity of DRR-gossip is O(logn) since all phases need O(logn) 
rounds. The message complexity is dominated by DRR algorithm in phase I which needs 0(n log log n) 
messages. 

The DRR-gossip-ave algorithm is more involved than the DRR-gossip-max algorithm. Unlike the Gossip- 
max algorithm which ensures that all the roots will have Max whp, the Gossip-ave algorithm only guarantees 
that the root of the largest tree in terms of tree size will have the Ave whp. To ensure that all the roots have 
Ave whp, after the Gossip-ave algorithm, the root of the largest tree has to spread out its estimate, the Ave, 
by using the Data- spread algorithm where the root of the largest tree sets its estimate, the Ave, computed 
by the Gossip-ave algorithm, as the data to be spread out. Therefore, every root needs to know in advance 
whether it is the root of the largest tree. To achieve this, the Gossip-max algorithm is executed beforehand on 
tree sizes which are obtained from the Convergecast-sum algorithm. (Note that the Gossip-max procedure 
in the DRR-gossip-max algorithm is executed on the local maximums computed by the Convergecast-max 
algorithm.) Every root could compare the maximum tree size obtained from the Gossip-max algorithm with 
the size of its own tree to recognize whether it is the root of the largest tree. (Note that the Gossip-max 
algorithm and the Gossip-ave algorithm can not be executed simultaneously, since the Gossip-ave algorithm 
does not have the sampling procedure as in the Gossip-max algorithm.) Finally, every root then broadcasts 
the Ave obtained from the Data-spread algorithm to all its tree members. 



Algorithm 7: DRR-gossip-max 

Run DRR{G) to obtain the forest F.; 
Run Convergecast-max(F,v).; 

Run Gossip-max(G, F, V, cawmax)-', 

Every root node broadcasts the Max to all nodes in its tree. ; 



Algorithm 8: DRR-gossip-ave 

Run DRR{G) algorithm to obtain the forest F.; 
Run Convergecast-sum(F, v) algorithm.; 

Run Gossip-max(G, F, V , cow sum{*, 2)) algorithm on the sizes of trees to find the root of the largest 
tree. At the end of this phase, a root z will know that it is the one with the largest tree size.; 
Run Gossip-ave(G, F, V, cavsum) algorithm.; 

Run Data-spread(G, F, V , Ave) algorithm — ^the root of the largest tree uses its average estimate, i.e., 
Ave, as the value to spread.; 

Every root broadcasts its value to all the nodes in its tree. 
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3.5 The complexity of DRR-gossip algorithms 

To conclude from the previous sections, the time complexity of the DRR-gossip algorithms is 0(log n) since 
all the phases need O(logn) rounds. The message complexity is dominated by the DRR algorithm in the 
phase I which needs 0(n log log n) messages. Thus, our DRR-gossip algorithms achieve the same time 
complexity as uniform gossip of [9| but reduce the message complexity to 0(n log log n). Although the 
efficient gossip of 1 8 1 can have the same message complexity, it will need 0(log log log n) time. 

4 Application to Sparse Networks — Local-DRR Algorithm 

In sparse networks, a small number of neighbors makes it feasible for each node to send messages to all of 
its neighbors simultaneously in one round. In fact, this is a standard assumption in the traditional message 
passing distributed computing model [19.1 (here it is assumed messages sent to different neighbors in one 
round can all be different). We show how DRR-gossip can be used to improve gossip-based aggregate 
computation in such networks. 

We assume that, in a round of time, a node of an arbitrary undirected graph can communicate directly 
only with its immediate neighbors (i.e., nodes that are connected directly by an edge). (Note that, in previous 
sections, any two nodes can communicate with each other in a round under a complete graph model.) Thus, 
on such a communication model, we have a variant of the DRR algorithm, called the Local-DRR algorithm, 
where a node only exchange rank information with its immediate neighbors. Each node chooses a random 
rank in [0, 1] as before. Then each node connects to its highest ranked neighbor (i.e., the neighbor which 
has the highest rank among all its neighbors). A node that has the highest rank among all its neighbors will 
become a root. Since every node, except root nodes, connects to a node with higher rank, there is no cycle 



in the graph. Thus this process results in a collection of disjoint trees. As shown in Theorem 1 1 below, the 
key property is that the height of each tree produced by the Local-DRR algorithm on an arbitrary graph is 
bounded by 0(log n) whp. This enables us to bound the time complexity of the Phase II of the DRR-gossip 
algorithm, i.e., Convergcast and Broadcast, on an arbitrary graph by O(logri) whp. 

Theorem 11 On an arbitrary undirected graph, all the trees produced by the Local-DRR algorithm have a 
height of at most 0(log n) whp. 

Proof: Fix any node uq. We first show that the path from uq to a root is at most O(logn) whp. Let 
ui,U2, ... be the successive ancestors of uq, i.e., ui is the parent of uq (i.e., uq connects to ui), U2 is the 
parent of ui and so on. (Note ui,U2, ■ ■ ■ are all null if uq itself is the root). Define the complement value to 
the rank of Ui as Cj := 1 — rank{ui), i > 0. The main thrust of the proof is to show that the sequence Cj, 
i > decreases geometrically whp. We adapt a technique used in lITSl . 

For t > Q, let It be the indicator random variable for the event that a root has not been reached after t 
jumps, i.e., uq, ui, . . . , are not roots. We need the following Lemma. 

Lemma 12 For any t>l and any z € [0, 1], E[Ct+iIt\CtIt~i = -z] < z/2. 

Proof: We can assume that z / 0; since Ct+i < Ct and It < It-i, the lemma holds trivially if z = 0. 
Therefore, we have = 1 and Cj = z > 0. We focus on the node ut. Denote the set of neighbors of node 
ut by U; the size of U is at most n — I. Let Y be the random variable denoting the number of "unexplored" 
nodes in set U, i.e., those that do not belong to the set {uq, ui, . . . , ut-i}. If y = 0, then ut is a root and 
hence Ct+ih = 0. We will prove that for all d>l, 

E[Ct+iIt\{{CtIt-i = z) A (y = d))] < z/2. (3) 

Showing the above is enough to prove the lemma, because if the lemma holds conditional on all positive 
values of d, it also holds unconditionally. For convenience, we denote the l.h.s. of Q as ^. 

Fix some d > 1. In all arguments below, we condition on the event "{Cth-i = z) f\ {Y = d)" . Let 
vi,V2, ■ ■ ■ ,Vd denote the d unexplored nodes in U. If rank{vi) < rank(ut) for alH (1 < i < d), then ut is a 
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root and hence Ct+ih = 0. Therefore, conditioning on the value y = miiij Cj = mini{l — rank{vi)) < z, 
and considering the d possible values of i that achieve this minimum, we get. 



Evaluating the above yields 



Jo 

1 - (1 - z^il + zd) 



{d+1) 

We can show that the r.h.s of the above is at most 2;/2 by a straightforward induction on d. 



Using Lemma M2| we now prove Theorem 1 1 



Wehave £'[Ci/o] < E[Ci] < 1. Hence by Lemma 12 and an induction on t yields that £'[Cf/t_i] < 2^*. 
In particular, letting T = 3 log n, where c is some suitable constant, we get E[CtIt-i] < n^^- 

Now, suppose ut = u and that CtIt-i = The degree of node u is at most n; for each of these nodes 
V, Pr(rank(v) > rank{u)) = Pr(l — rank{v) < 1 — rank{u)) = Pr(l — rank{v) < z) = z. Thus the 
probabihty that u is not a root is at most nz; more formally, Vz, Pr(/T = 1\CtIt-i = z) < nz. So, 

Pr(/T = 1) < lognE[CTlT-i] < n/n^ = Xjr?. 

Hence, whp, the number of hops from any fixed note to the root is O(logn). By union bound, the 
statement holds for all nodes whp. ■ 

Similar to Theorem [2j we can bound the number of trees produced by the Local-DRR algorithm on an 
arbitrary graph. 

Theorem 13 Let G he an arbitrary connected undirected graph having n nodes. Let di = 0{n/ logn) be 
the degree of node i, 1 < i < n. The number of trees produced by the Local-DRR algorithm is 0{Y^^=i jqrf) 
whp. Hence, ifdi = d, Vi, then the number of trees is 0{n/d) whp. 

Proof: Let the indicator random variable Xi take the value of 1 if node z is a root and otherwise. Let 
X = Xi be the total number of roots. Pr(Xj = 1) = + 1) since, this is the probability its 

value is the highest among all of its di neighbors. Hence, by linearity of expectation, the expected number 
of roots (hence, trees) is E[X] = Yll=i E[Xi] = Yll=i '(F+i - show concentration, we cannot directly 
use a standard Chernoff bound since XjS are not independent (connections are not independently chosen, 
but fixed by the underlying graph). However, one can use the following variant of the Chernoff bound from 
ifTSll (cf. Lemma [T]),which works in the case of dependent indicator random variables that are correlated as 
defined below. For random variables, Xi, . . . , Xi, . . . , X„ and for any Si-i C {1, . . . , i — 1}, Pr(Xj = 
1| /\j^Si-i -^J = 1) ^ ^T^i^i = 1). This is because if a node's neighbor is a root, then the probability that 
the node itself is a root is 0. Also, the assumption of di = 0{n/ log n) ensures that E[X] is ri(log n), so the 
Chernoff bound yields a high probability on the concentration of X to its mean ■ 
We make two assumptions regarding the network communication model: (1) as mentioned earlier, a node 
can send a message simultaneously to all its neighbors (i.e., nodes that are connected directly by an edge) in 
the same round; (2) there is a routing protocol which allows any node to communicate with a random node 
in the network in 0{T) rounds and using 0{M) messages whp. Assumption (1) is standard in distributed 
computing literature |[2l [T9ll . As for Assumption (2), there are well-known techniques for sampling a random 
node in a network, e.g., using random walks (e.g., 1*261) or using special properties of the underlying topology, 
e.g., as in P2P topologies such as Chord [10|. Under the above assumptions, we obtain the performance of 
DRR-gossip using the Local-DRR algorithm on sparse graphs in the following Theorem. 

Theorem 14 On a d-regular graph G{V,E), where \ V\ = n and d = 0(n/ log n), the time complexity 
of the DRR-gossip algorithms is 0(logn + Tlog ^) whp by using the Local-DRR algorithm and a routing 
protocol running in 0{T) rounds and 0{M) messages (whp) between a gossip pair; the corresponding 
message complexity is 0{\E\ + ^Mlog ^) whp. 
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Proof: Phase I (Local-DRR) takes 0(1) time, since each node can find its largest ranked neighbor in 
constant time (Assumption 1) and needs 0(|£'| ) messages in total (since at most two messages travel through 



an edge). Phase II (convergecast and broadcast) takes 0(log n) time (by Theorem 12 and Assumption 1) and 
0(n) messages. Phase III (uniform gossip) takes 0(T log ^) time (Assumption 2) and needs 0(^M log ^) 
messages (Assumption 2 and Theorem [T3]). ■ 
We can apply the above theorem to Chord ll25l . Each node in Chord has a degree d = 0(log n). Chord 
admits an efficient (non-trivial) protocol (cf. |[TOl ) which satisfies Assumption (2) with T = O(logn) and 
M = O(logn) (both in expectation, which is sufficient here). Hence the above theorem shows that DRR- 
gossip takes 0(log^ n) time and O(nlogn) messages whp. In contrast, the straightforward uniform gossip 
|[9ll gives 0(T log n) = 0(log^ n) rounds and 0{M ■ n log n) = 0(n log^ n) messages whp. 

5 Lower Bound for Address-Oblivious Algorithms 

We conclude by showing a non-trivial lower bound result on gossip-based aggregate computation: any 
address-oblivious algorithm for computing aggregates requires Q{n log n) messages, regardless of the num- 
ber of rounds or the size of the (individual) messages. We assume the random phone call model: i.e., com- 
munication partners are chosen randomly (without depending on their addresses). The following theorem 
gives a lower bound for computing the Max aggregate. The argument can be adapted for other aggregates as 
well. 

Theorem 15 Any address-oblivious algorithm that computes the Maximum value, Max, in a n-node net- 
work needs Q,{n log n) messages whp ( regardless of the number of rounds). 

Proof: We lower bound the number of messages exchanged between nodes before a large fraction of 
the nodes correctly knows the (correct) maximum value. Suppose nodes can send messages that are arbitrary 
long. (The bound will hold regardless of this assumption.) Without loss of generality, we will assume that 
a node can send a list of all node addresses and the corresponding node values learned so far (without any 
aggregation). For any node i to have correct knowledge of the maximum, it should somehow know the values 
at all other nodes. (Otherwise, an adversary — who knows the random choices made by the algorithm — can 
always make sure that the maximum is at a node which is not known by i.) There are two ways that i can 
learn about another node j's value: (1) direct way: i gets to know j's value by communicating with j directly 
(at the beginning, each node knows only about its own value); and (2) indirect way: i gets to know j's value 
by communicating with a node w ^ j which has a knowledge of j's value. Note that w itself may have 
learned about j's value either directly or indirectly. 

Let Vi be the (initial) value associated with node i, 1 < i < n. We will assume that all values are distinct. 
By the adversary argument, the requirement is that at the end of any algorithm, on the average, at least half of 
the nodes should know (in the above direct or indirect way) all of the Vi, 1 < i < n. Otherwise, the adversary 
can make that value that is not known to more than half of the nodes, the maximum. We want to show that the 
number of messages needed to satisfy the above requirement is at least an log n, for some (small) constant 
c > 0. In fact, we show something stronger: at least cn log n (for some small c > 0) messages are needed if 
we require even n^^^^ values to be known to at least 0,{n) nodes. 

We define a stage (consisting of one or more rounds) as follows. Stage 1 starts with round 1. If stage 
t ends in round j, then stage t + 1 starts in round j + 1. Thus, it remains to describe when a stage ends. 
We distinguish sparse and dense stages. A sparse stage contains at most en messages (for a suitably chosen 
small constant e > 0, fixed later in the proof). The length of these stages is maximized, i.e., a sparse stage 
ends in a round j if adding round j + 1 to the stage would result in more than en messages. A dense stage 
consists of only one round containing more than en messages. Observe that the number of messages during 
the stages to j is at least (j — l)en/2 because any pair of consecutive stages contains at least en messages 
by construction. 

Let Si{t) be the set of nodes that know v-i at the beginning of stage t. At the beginning of stage 1, 
15^(1)1 = 1, for all 1 <i<n. 
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At the beginning of stage t, we call a value as typical if it is known by at most 6*logn nodes (i.e., 
I Si (t) I < 6* log n) and it was typical at the beginning of all stages prior to t. All values are typical at the 
beginning of stage 1 . Let kt denote the number of typical values at the beginning of stage t. 

The proof of the Theorem follows from the following claim. (Constants specified will be fixed in the 
proof; we don't try to optimize these values). 

Claim: At the beginning of stage t, at least (l/6)*n values are typical w.h.p., for all t < 6logn, for a 
fixed positive constant 6. 

The above claim will imply the theorem since at the end of stage t = Slogn, \Si{t)\ < o(n) for at 
least n^^^^ values, i.e., at least n^^^-* values are not yet known to 1 — o(l) fraction of the nodes after stage 
t = 6 log n. Hence the number of messages needed is at least 0(n log n). 

We prove the above claim by induction: We show that if the claim holds at the beginning of a stage then 
it hold at the end of the stage. We show this regardless whether the stage is dense or sparse, and thus we have 
two cases. 

Case 1: The stage is dense. A dense stage consists of only one round with at least en messages. Fix a typical 
value Vi. Let Ui{t) = V — Si{t), i.e., the set of nodes that do not know Vi at the beginning of stage t. For 
1 < k{i) < \Ui{t)\, let denote the indicator random variable that denotes whether the k{i)th of these 

nodes gets to know the value Vi in this stage. Let Xi{t) = Yl^^(i)=i ^k{i)- Let it be a node that does not know 
Vi. u can get to know Vi either by calling a node that knows the value or being called by a node that knows the 
value. The probability it gets to know Vi by calling is at most 6* log n/n and the probability that it gets called 
by a node knowing the value is at most 6* log n/n (this quantity is o( 1 ) , since t < 6 log n and 6 is sufficiently 
small). Hence the total probability that it gets to know Vi is at most 2 • 6* log n/n. Thus, the expected number 

of nodes that get to know Vi in this stage is E[Xi{t)] = Yl^^(i)=i Pi'{^fc(i) = 1} < 2 • 6* log n. The variables 
are not independent, but are negatively correlated in the sense of Lemma |T] and using the Chernoff 
bound of this Lemma we have: 

Pr(X,(t) > 5 • 6*logn) = Pr(X,(t) > (1 + 3/2) • 2 • 6*logn) < l/n^. 

By union bound, w.h.p., at most 5 • 6* new nodes get to know each typical value. Thus w.h.p. the total 
number of nodes knowing a typical value (for every such value) in this stage is at most 6* log n + 5 • 6* log n = 
Qt+i jj^jjg satisfying the induction hypothesis. It also follows that a typical value at the beginning of a 
dense phase remains typical at the end of the phase, i.e., fc^+i = kt w.h.p. 

Case 2: The stage is sparse. By definition, there are at most en messages in a sparse stage. Each of these 
messages can be a push or a pull. A sparse stage may consist of multiple rounds. 

Fix a typical value Vi. W.h.p, there are at most 6* logn nodes that know a typical value at the beginning 
of this stage. Using pull messages, since the origin is chosen uniformly at random, the probability that one 
of these nodes is contacted is at most l/n(en) = e. Hence the expected number of messages sent by nodes 
knowing this typical value is at most e6* log n. Thus the expected number of new nodes that get to know this 
typical value is at most e6* log n. The high probability bound can be shown as earlier. 

We next consider the effect of push messages. We focus on values that are typical at the beginning of this 
stage. We show that high probability at least some constant fraction of the typical values remain typical at 
the end of this phase. As defined earlier, let kt be the number of such typical values. In this stage, at most en 
nodes are involved in pushing — let this set be Q. Consider a random typical value x. Since a typical value 
is known by at most 6* log n nodes and destinations are uniformly randomly chosen, the probability that x is 
known to a node in Q is 0{ — ^^)- Hence the expected number of times that x will be pushed by set Q is 
at most 0(e6* log n). Now, the number of times x has to be pushed is at least (6 — e) • 6* log n to exceed the 
required expansion for this value whp (as argued in the above para, pulling only results in at most e6* log n 
messages having being sent out w.h.p). By Markov's inequality, the probability that x is pushed more than 
(6 — e) • 6* log n times by nodes in set Q is at most Hence the expected number of typical values that can 
expand is at most Qz^h. Thus, in expectation, at least 1 — fraction of the typical values remain typical. 
High probability bound can be shown similar to case 1 . We want 1 — > 1/6, for the induction hypothesis 
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to hold; this can be satisfied by choosing e small enough. 



6 Concluding Remarks 

We presented an almost-optimal gossip-based protocol for computing aggregates that takes O(nloglogn) 
messages and O(logn) rounds. We also showed how our protocol can be applied to improve performance 
in networks with a fixed underlying topology. The main technical ingredient of our approach is a simple 
distributed randomized procedure called DRR to partition a network into trees of small size. The improved 
bounds come at the cost of sacrificing address-obliviousness. However, as we show in our lower bound, 
this is necessary if we need to break the the il(nlogn) message barrier. An interesting open question is to 
establish whether r2(n log log n) messages is a lower bound for gossip-based aggregate computation in the 
non-address oblivious model. Another interesting direction is to see whether the DRR technique can be used 
to obtain improved bounds for other distributed computing problems. 
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